Skip to content

Conversation

@aheizi
Copy link
Contributor

@aheizi aheizi commented May 13, 2025

Related GitHub Issue

#3555

Description

Currently, roo-code reads files by default according to utf-8. When the file encoding is GBK or others, it will cause garbled text problems

Test Procedure

manual testing

  1. Set VSCode's default encoding to GBK
  2. Let roo-code read this file and edit this file

Type of Change

  • 🐛 Bug Fix: Non-breaking change that fixes an issue.
  • New Feature: Non-breaking change that adds functionality.
  • 💥 Breaking Change: Fix or feature that would cause existing functionality to not work as expected.
  • ♻️ Refactor: Code change that neither fixes a bug nor adds a feature.
  • 💅 Style: Changes that do not affect the meaning of the code (white-space, formatting, etc.).
  • 📚 Documentation: Updates to documentation files.
  • ⚙️ Build/CI: Changes to the build process or CI configuration.
  • 🧹 Chore: Other changes that don't modify src or test files.

Pre-Submission Checklist

  • Issue Linked: This PR is linked to an approved GitHub Issue (see "Related GitHub Issue" above).
  • Scope: My changes are focused on the linked issue (one major feature/fix per PR).
  • Self-Review: I have performed a thorough self-review of my code.
  • Code Quality:
    • My code adheres to the project's style guidelines.
    • There are no new linting errors or warnings (npm run lint).
    • All debug code (e.g., console.log) has been removed.
  • Testing:
    • New and/or updated tests have been added to cover my changes.
    • All tests pass locally (npm test).
    • The application builds successfully with my changes.
  • Branch Hygiene: My branch is up-to-date (rebased) with the main branch.
  • Documentation Impact: I have considered if my changes require documentation updates (see "Documentation Updates" section below).
  • Changeset: A changeset has been created using npm run changeset if this PR includes user-facing changes or dependency updates.
  • Contribution Guidelines: I have read and agree to the Contributor Guidelines.

Screenshots / Videos

before:
image

image

after:

image image

Documentation Updates

Does this PR necessitate updates to user-facing documentation?

  • No documentation updates are required.
  • Yes, documentation updates are required. (Please describe what needs to be updated or link to a PR in the docs repository).

Additional Notes


Important

Introduces readFileWithEncoding to handle multiple file encodings, replacing fs.readFile in key tools to prevent garbled text issues, and adds necessary dependencies.

  • Behavior:
    • Introduces readFileWithEncoding in readFileWithEncoding.ts to handle multiple file encodings, including UTF-8, UTF-16, and GBK.
    • Replaces fs.readFile with readFileWithEncoding in applyDiffTool.ts, insertContentTool.ts, searchAndReplaceTool.ts, and DiffViewProvider.ts to prevent garbled text issues.
  • Dependencies:
    • Adds iconv-lite and jschardet to package.json for encoding detection and conversion.
  • Misc:
    • Updates extract-text.ts to use readFileWithEncoding for non-binary files.

This description was created by Ellipsis for 3f36e526a3a5a0f4668b4c53ff205afa3db26a33. You can customize this summary. It will automatically update as commits are pushed.

@changeset-bot
Copy link

changeset-bot bot commented May 13, 2025

⚠️ No Changeset found

Latest commit: da3e39a025379e5fec86b15f957bb92579a8edf6

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@hannesrudolph hannesrudolph moved this from New to PR [Draft/WIP] in Roo Code Roadmap May 14, 2025
@aheizi aheizi marked this pull request as ready for review May 14, 2025 03:11
@aheizi aheizi requested review from cte and mrubens as code owners May 14, 2025 03:11
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. bug Something isn't working labels May 14, 2025
@aheizi aheizi marked this pull request as draft May 14, 2025 03:43
@aheizi aheizi marked this pull request as ready for review May 16, 2025 15:51
@hannesrudolph hannesrudolph moved this from New to PR [Pre Approval Review] in Roo Code Roadmap May 20, 2025
@hannesrudolph hannesrudolph moved this from PR [Needs Review] to TEMP in Roo Code Roadmap May 26, 2025
@daniel-lxs daniel-lxs moved this from TEMP to PR [Needs Review] in Roo Code Roadmap May 27, 2025
Copy link
Member

@daniel-lxs daniel-lxs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @aheizi, Thank you for your contribution. I apologize we took so long to review your PR.

Looking at the whole flow, it seems like we're doing encoding detection twice - once with chardet and then again with your custom logic. Could we simplify this to just use chardet's result?

Thank you again for your contribution and patience, I'm looking forward to getting this PR ready for review.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this alwaysTextExtensions array is also defined in extract-text.ts but with a different format (dots vs no dots). Should we maybe centralize this list somewhere to avoid duplication?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, It has been extracted as a public constant

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maybe log these errors for debugging? Silent failures could make it hard to troubleshoot encoding issues later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This catch block handles failures during the process of attempting different encoding and decoding files. This is an expected possible situation rather than a serious error. Code design involves trying multiple encodings until the best match is found, so it is normal for some encodings to fail.
If it is to be added, I think a debug log can be added, but I haven't found any places where the debug log is used in the project.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For large files, wouldn't decoding the entire buffer multiple times be slow? Have you considered just trusting chardet's detection and only falling back if that fails?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great point — decoding large buffers multiple times is definitely something to watch out for in terms of performance.

You’re absolutely right that for large files, it makes sense to avoid trying multiple encodings upfront. One improvement I’m planning is to set a file size threshold (e.g. 1MB):

  • For small files, we keep the current logic — try several likely encodings and score the result.
  • For large files, we’ll first trust chardet and decode using its result. Only if the decoded content looks suspicious (e.g. low score, unreadable characters), we’ll fallback to trying a few alternatives like gb18030.

This way we preserve accuracy for tricky cases (e.g. GBK-encoded .js or .txt files that start with mostly ASCII), while avoiding unnecessary work for large files where chardet is usually good enough.

Happy to push this change if it sounds good to you.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How did you arrive at the 0.05 threshold? Have you tested this with different types of files to see if this value works well across various scenarios?

I'm curious about this custom scoring system, would just trusting chardet's detection be enough in this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The scoring system was introduced as a safeguard against misdetections from chardet, especially in East Asian contexts. In practice, chardet often misclassifies GBK/GB18030 files as UTF-8 if the text is mostly ASCII (e.g. source code with only occasional Chinese comments). A simple confidence score from chardet doesn’t always reflect actual readability.

The scoreText function favors encodings that produce a reasonable amount of Chinese or full-width characters, and penalizes pure ASCII results. The 0.05 threshold came from empirical testing across a mix of file types:

  • UTF-8 Chinese content typically scores around 0.2–0.6
  • GBK files decoded incorrectly as UTF-8 usually get negative or near-zero scores
  • Pure ASCII text tends to score around -1

@aheizi
Copy link
Contributor Author

aheizi commented May 31, 2025

Hey @aheizi, Thank you for your contribution. I apologize we took so long to review your PR.

Looking at the whole flow, it seems like we're doing encoding detection twice - once with chardet and then again with our custom logic. Could we simplify this to just use chardet's result?

Thank you again for your contribution and patience, I'm looking forward to getting this PR ready for review.

Hi, @daniel-lxs Thank you for taking the time to review this PR!

You’re right — the flow involves detecting encoding with chardet, and then trying multiple candidate encodings including the one from chardet. The reason for this is that chardet’s detection can often be unreliable, especially for short or ambiguous files (e.g. GBK-encoded .js or .txt files that contain mostly ASCII). In such cases, decoding only with chardet’s top guess can lead to misinterpretation or mojibake.

The secondary scoring pass (via scoreText) helps us choose the most plausible decoding result among a few likely encodings, particularly prioritizing those common in Chinese or East Asian contexts (utf-8, gb18030, shift_jis, etc.).

@aheizi aheizi force-pushed the fix-file-encoding branch from da3e39a to 00b1261 Compare June 1, 2025 03:22
@daniel-lxs daniel-lxs moved this from PR [Draft / In Progress] to PR [Needs Prelim Review] in Roo Code Roadmap Jun 2, 2025
@aheizi aheizi force-pushed the fix-file-encoding branch from 6156fea to 00ffe14 Compare June 3, 2025 02:29
@daniel-lxs
Copy link
Member

Hi @aheizi, thanks for your work on fixing the file encoding issues. This is an important area to get right.

The current approach in readFileSmart uses several custom rules and thresholds (like the scoreText function, the 0.05 scoring limit, the list of text file extensions, and the special logic for large files) to guess the file encoding.

While this might work for the cases you've tested, relying on many custom rules like this can make the solution complex and potentially unreliable as we encounter different files or new situations in the future. It can also make the code harder to understand and maintain.

We need to find a simpler and more robust way to handle file encodings. For example, we should explore:

  • Relying more directly on chardet's detection capabilities and its reported confidence. If chardet is uncertain, we could have a very straightforward fallback (e.g., to UTF-8, or prompt the user if that's feasible).
  • Investigating if we can leverage VS Code's own encoding detection mechanisms, as it's generally quite good.

The aim is to have a solution that is dependable and easier to maintain, rather than a complex system of custom checks. A clear method for common encodings with a well-defined, simple fallback is generally preferred.

Could you explore these simpler approaches?

@daniel-lxs daniel-lxs moved this from PR [Needs Prelim Review] to PR [Draft / In Progress] in Roo Code Roadmap Jun 4, 2025
@daniel-lxs daniel-lxs marked this pull request as draft June 4, 2025 16:22
@aheizi
Copy link
Contributor Author

aheizi commented Jun 4, 2025

Hi @aheizi, thanks for your work on fixing the file encoding issues. This is an important area to get right.

The current approach in readFileSmart uses several custom rules and thresholds (like the scoreText function, the 0.05 scoring limit, the list of text file extensions, and the special logic for large files) to guess the file encoding.

While this might work for the cases you've tested, relying on many custom rules like this can make the solution complex and potentially unreliable as we encounter different files or new situations in the future. It can also make the code harder to understand and maintain.

We need to find a simpler and more robust way to handle file encodings. For example, we should explore:

  • Relying more directly on chardet's detection capabilities and its reported confidence. If chardet is uncertain, we could have a very straightforward fallback (e.g., to UTF-8, or prompt the user if that's feasible).
  • Investigating if we can leverage VS Code's own encoding detection mechanisms, as it's generally quite good.

The aim is to have a solution that is dependable and easier to maintain, rather than a complex system of custom checks. A clear method for common encodings with a well-defined, simple fallback is generally preferred.

Could you explore these simpler approaches?

Hi @daniel-lxs , thanks a lot for your thoughtful feedback — I really appreciate it.

Regarding your suggestions:
1. VS Code’s encoding detection: I also initially considered using VS Code’s built-in encoding detection. However, based on my research, the extension API (e.g., vscode.workspace.openTextDocument) doesn’t actually auto-detect encodings. It defaults to UTF-8, so unfortunately it doesn’t help much in our case.
2. Using chardet directly: I tested this as well, but found that chardet performs poorly in distinguishing between UTF-8 and GBK — the two main encodings we need to handle. Its confidence scores in these cases are often too low or misleading, which makes it unreliable on its own.

Given those limitations, I opted for the current approach, which is admittedly more complex, but has worked reliably in the cases I tested. It includes some heuristics and fallback logic that aim to cover common scenarios while still falling back to UTF-8 if detection fails.

That said, I completely agree with the goal of simplifying this logic. If we can find a more robust and maintainable way to handle encoding detection — especially one that avoids custom heuristics — I’m absolutely open to revisiting the current implementation. For now, I think this version is a step forward in terms of correctness and can serve as a foundation we can refine.

Thanks again for the suggestions — I’d love to keep the discussion going if you have any further ideas.

@aheizi
Copy link
Contributor Author

aheizi commented Jun 9, 2025

@daniel-lxs In the latest submission, I referred to the implementation of encoding in vscode. https://github.com/microsoft/vscode/blob/main/src/vs/workbench/services/textfile/common/encoding.ts

@aheizi aheizi marked this pull request as ready for review June 9, 2025 10:53
@aheizi aheizi requested a review from jr as a code owner June 9, 2025 10:53
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jun 9, 2025
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider using the built-in buffer.toString("latin1") instead of manually looping over the buffer in encodeLatin1 for improved performance and clarity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is related to VS Code comments.

// before guessing jschardet calls toString('binary') on input if it is a Buffer,
// since we are using it inside browser environment as well we do conversion ourselves
// https://github.com/aadsm/jschardet/blob/v2.1.1/src/index.js#L36-L40

@aheizi aheizi force-pushed the fix-file-encoding branch from 3f36e52 to 3e079d6 Compare June 9, 2025 12:28
@daniel-lxs daniel-lxs moved this from PR [Draft / In Progress] to PR [Needs Prelim Review] in Roo Code Roadmap Jun 10, 2025
@daniel-lxs
Copy link
Member

Hey @aheizi, Thank you for taking the time to tackle this issue, unfortunately I don't think the implementation aligns with our goals, this adds quite a complex rating system to detect encoding and that complexity makes it hard to test and maintain.

This doesn't mean your implementation is bad or that the issue is not important, we just want a simpler solution to this issue if exists.

I'll close this issue but feel free to continue the discussion. I'll also gladly answer any questions you might have.

Thank you again!

@daniel-lxs daniel-lxs closed this Jun 12, 2025
@github-project-automation github-project-automation bot moved this from PR [Needs Prelim Review] to Done in Roo Code Roadmap Jun 12, 2025
@github-project-automation github-project-automation bot moved this from PR [Draft/WIP] to Done in Roo Code Roadmap Jun 12, 2025
SmartManoj pushed a commit to SmartManoj/Raa-Code that referenced this pull request Jun 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working PR - Needs Preliminary Review size:XL This PR changes 500-999 lines, ignoring generated files.

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

3 participants